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Structural Components Revealed by Evaluating the ' 
-Quality of Elementary School Tests* | 


Joseph M. Petrosko 
University of Louisville 


and 


Esther Shani Be ' 
Jerusalem, Israel _ , 


. 


* Paper presented at the annual meeting of the National Council on 
Measurement in Education, New York.City, April 1977. ‘ : 


: J 4h < 
] a : * 
: This stady was cutie to test a tienes and, on a more practical 
“~~ 4 level, to report results useful to people who must select tests. Finding . 
the savrapai sts test for a ‘particular educational purpose present problems 
for many researchers- Difficult Judguenes are involved. Although lip 
service is paid to the principle that selecting a standardized test 


requires a thorough consideration of many factors, in fact, the ‘ii aad 


oe is rarely followed. And not surprisingly so. There are thous ds of 
| 7 tests on the markets and trying to judge which one might be be t fora r 
particular purpose is simply beyond. the resources of many test users. 

In order to provide a simple-to-use but detailed quality guide in 
this area, a comprehensive rating system for the evaluation of tests has 
been developed (Hoepfner, Conniff, Petrosko, Watkins, Erlich, Todaro, MY 
Hoyt, McGuire, Klibanoff , Stangel, Lee, Rest, Hufano, Bastone, Ogilvie, 
Hunter,.&,Johnson, 1974; ieauiiee, Stern, & Nummedal, 1971; Hoepfner, 


numerical ratings: of educational and psychometric quality can be used to 


eo 


i 

{ 

. | 
Strickland, Stangel, Jansen, § Patalino, 1970). Using this system, t 
e . * 

} 

' 


donpare standardized tests. The ratings reflect criteria grouped into 


[fou general areas of test quality: Measurement Validity, Examinee 


(yielding an* acronym for the evaluation system - MEAN) . 


t 

t 
/ Appropriatentss, Administrative Usability, and Normed Technical Excellence | 
A 


With some variations, the procedure used in implementing the MEAN 
system was similar each time that it was applied.. The evaluation process 
. .* was initiated by the acquisition’ of virtually ali published tests at 


, the relevant grade levels. Tests were then ‘categorized into educational 


—_— 


> 


goal areas sick: svaniated against the MEAN rating scales. At least two 
persons, working independently, performed the ratings. (A‘third rater 
was used when there were disagreements between the first two.) The final, 
outcome was the publication of chaste ues in books available to test 
users. F a 4 


The ratings were primarily conceived as a source of comparative 


information on those tests which were designed to measure the same general | 


® : 
outcome. Using this "consumer's guide," a person could, for example, 


compare’ various tests in reading comprehenSion with one another. After 


examining the strengths and weakensses of various instruments, a selection 


could be made of a test suitable for a given set of educational circumsthnces. 


The rat ihgs are also useful for another purpose, however. They can 
be used to examine the sity of tests in guia and to discover how 
the various elements of test quality relate to one another. Questions 
like these. can be addressed. How do the rated validity, reliability, 

‘and score distribution characteristics of tests relate to one snother? 
Aré reliable tests valid? Are tests with good norms also — n 
possessed’ of a good physical format? F 

As a vehicle for answering questions like these, Esther Shani 
proposed a theory for the quality structure of standardized tests (Shani 

& Petrosko, 1976). Using data from evaluations of secondary school 

tests (Hoepfner ee al., 1974), the theory successfully predicted a 

structural eudihaseatlie 05 explain the correlations of quality sebines 
To explore the generalizability of this theory, the present study 


was undertaken. Correlations obtained from elementary level tests 


, (Hoepfner et al., 1970) were analyzed to determine if the theory 


o t; 
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developed for secondary level tests would stil] be applicable to another 


’ 


* age Yevel. The analysis could also provide usefil information for 


testj)users. . ° ; ‘ “ 


METHOD ~ 


Tle Theory 
The theory employed in this study follows directly from-the study 


of Shani § Petrosko (1976). Adaptations were made, where necessary, 
to reflect the differences between the MEAN evaluation system as 
employed with elementary tests/ . 

Giventhe requirements of the study--conceptualizing and relating a 
number of variables to one another--the obvious need was for a technique 


of conceptualization and analysis suitable for a set of multivariate 


data. Facet theory developed by Guttman (1965) offered the advantage 


of a well developed ‘method for linguistically processing the many 
variables involved and also “providing a link to an analytic technique for 
mathematical processing of the data. Facet theory has been applied to 

such content areas as attitude measurement (Mori , 1965) and intelligence 
testing (Schlésinger.,§ Guttman, 1969) and is a general approach to research 
applicable to any content area where sets of variables can be identified 

in terms of more basic sets or facets. 

Examination of the MEAN test evaluation criteria for digamntery school 
tests revealed an emergent theory about a ‘structure for evaluating , 
standardized tests. The overall outlines of this theory might .be drawn 
by weictag two questions: (a) What, are components of a test evaluatiory that 


- 


are inherent in the construction /iand development of the test?; (b) ’ 


\ 


a 
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What is the relationship of the test development process to the 


examinee? d 
The two: considerations could be expressed ,as two facets (or sets). 
peeks A _ Components of a test evaluation inherent in the test and 
. its development. (Five elements) . Ke 
a, Theoretical conceptualication of the test. 
az Chargctetistits and format of items 
_a3 Test instructions 
a, Empirically determined validity ‘and reliability 
. a5, Test scores and norms . 
B Relationship of the test development process to .the pecataans, 
. (Three eléments) 
Q 
b, Initial test construction activities 
by Standardization and refinement of a test through sampling ° 
from a population 
bz Direct contact between the test and examinee 
The elements of facet A are assumed independent of one another and, 
therefore, define it as a polarizing facet. Criteria related to these 
elements would emerge as independent factors in a factor analysis Facet 
* B could be defined as an ordered facet, with each element showing a. 
different degree of relationship between the examinee and the test's 
development. 
‘ The elements of facet A relate to independent aspects of & standardized > 


Ld 

test about which quality assessment can be made. For example, one can 
e . 

ask the question: what evidence does a particular test present that 

sufficient efforts were taken in its theofetical conceptualization (a,) 


o Wa ee “ eats . U ° . : ‘ . 
. @r-in the way. items were written fag) to. opergtiqnalize the theoretical 


conceptualization? Similar questions can be Asked about the remaining 
' tlree elements. 


Facet B contains three clements--all rqjfated to the degree in Which 


the development of a test. relates to an exahinee. Element by, Initial 
_ test canstruction activities, has the smallest relationship between an_ 
: : \ : ; 


individual examinée and: the test's development. During item writing 
activities, the authors typically have no specific individual in mind, 
and construct items for a broad spectrum of examinee’ types within the ~ 
general constraints of the age level intended. Gtandundéaaeion and 
eee of a test through sampling from a population, elenent bz 
involves activities which are ‘nore closely related to an individual 


examinee who will eventually take the published test. For example, 


, 


validity are reliability studies carried out by a test developer would be 
associated with this element. Pinatas, element bz shows the closest 
relationship between the test and the test-taker. Direct contuct : 
between the test and examinee generally involves aspects of the sutuas 
test-taking situation, e.g., format and clarity of tems. In summary, 
the elements of facet B may be conSidered to lie‘on a sonbtnnian . 
spanning the degree of relationship between a test develdper and a person 
actually taking a developed test. . 

A structure for evaluating a.standardized test‘in terms of facets.A 


and B can be defined in the following mapping sentence: 


The quality of test (x) with respect/to components 


ay Theoretical conceptualizatjon 
, 42 Item characteristics and fgrmat 
. a, Test instructions 
Say Empirical validity and re iability 
. & Test secres and no 
at the 
least (initial constr tion) 
medium (standardization on pppulation) 
highest (diréct. contact 
level of relagionship with the examinde-> nities high to very low quality 
According to Guttman 1970), concepts dealt with by two facets, 
oné of which is polarizing and the omer ordpred, tend to' show a radex 
structure in the analysys of empirical data/based on the facets. It 
‘was hypothesized that Analysis of date — oe MEAN evaluations of 
elementary school tests would yield/such a radex structure. 
The analysis that was seen as most appropriate was Guttman's 
Smallest Space Anafysis (SSA). The latter, as are several other nonmetric 
multi-dimensionayY scaling techniques, is based upon a simple principle: 
the higher the orrelation between two variables, the smaller is the 


f 


represénted distance between two points representing the variables. © 
\ 


If ry, > rg4, then di. < *d3q4, where: r = correlation coefficient; 


d= distante in space. | \ 
. » 


Test evafuation criteria 
Foy the elementary school test evalyations, tests were acquired and 


evaluited for grades 1, 3, 5 and 6. Trained raters used only the specimen 


tests aid other supporting material sent by the publisher. Each test 

was first categorized into one of 145 goals of elementary education, 

These goals constituted a comprehensive taxonomy of elementary education 
p ; 


in terms of student outcomes. After this categorization, evaluators 


. 


rated the test on the 24 criteria of the MEAN system. For each criterion, 


each test was awarded zero to a specified number of points, depending 


gn its possessing the desired trait in question. 


, Table 1 shows each criterion, its facet profile (each criterion’ 


' 4 


being a structuple of facets A and B), and the range of possible. points 
: sia 


a test could receive for the criterion. Complete descriptions of the 
| criteria fre contained in Hoepfner et al. (1970). It might be notéd 
[ - that the criteria differ somewhat from those used with secondary school 


+, tests and analyzed by Shani and Petrosko (1976). 


wv 


Table 1 / 


lementary Test Evaluation Criteria with’ Faust Profiles 
and Ranges of Points Awarded - 


‘/ 


Profile Criterion ‘ \ Range 
a,b . *  \ J, Content/Construct Validity |. , * 9-10 
a,b2 . | a iohnuenmntyprediative Validity ; -0-5 
azb7 f. ‘3. Content Comprehension. ” 0-4 
asbz : 4, Instructions Comprehension’ 7° 0-4 
agbz , 5. Visual Format iin 0-2 
dig , 6. Quality od euhmieandaas . ; 0-1 
azbs. , 7. Time and Pacing 4: 0-1 
a3b. a4 Response Recording Q-2) 
ash; 4 Test Administration: (Group) . 0-2 
ach; 10. Training of Administrators : 0-1 
asb, 11. Administration (Time) 0-1 
asb, 12. Scoring 3 = > 0-2 
asd. 13. Norm Range ‘ ; 0-1 
asb, 14 Score Interpretability , ~ O-1¢ 
asb, 15. Score Conversion : a8 
acd, - 16. Norm Representativeness ‘ . , 0-1 
asb. 17. Score Interpreter e 0-1 
ab. 18. Decision-Making Utillsy 0-3 ; 

yb, 19, Test-Retest Reliability ae 0-5 

a4b, ‘ 20. Internal Consistency Reliability 0-3 
aybs 21. Nitwonake Form Reliability . 0-3 
Agd, ec "22.  Replicability , 7 0-1. 
a,b, 23. Range of Coverage - , . >" 9-3 
ugha - dq. Graddbian gf Scores a b4° 


mee e-4 
. mee - wee > ee 
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The first two ratings deal with validity. ‘Content and consksucts 


validity referred to whether the test ‘measured the specific educational 


objective that the test was categogtzed under. Concurrent and predictive 
validity referred to evidence that ‘such vey studies- had been 


performed. . . 


: - v . 
' Evaluation criteria 3 through 8 were relattd'to the general theme of 


' Examinee Appropriateness. Content compréhension and Instructions 
comprehénsion dealt with the perceived “clarity of the items themselves 

of the test's huevalt instructions. the: criterts Visual fecha and : 
Quality of illustrations had to do with physical ‘arrangement, of items on ° 
the page and quantity of printing and ‘graphics. Time ane pacing required 
a judgment about ate an ‘instrument was a power test or was upnecessarily 
speeded. Response cian related to i nattiee there’ Was a simple and 
direct connection between the item stem and the ‘recording of a response. 

The next set of evaluation criteria - 9 through 18 fell under the 
general arga of Administrative usability. Criterion 9, Test Administratiion, 
gave tests a positive meee if they were deStgned for group rather 
than individual or ‘oxi “group administration. " Training of Administrators 
was used to’ downgrade those tests requiring a psychometrist to.administer. 
The criterion Administration credited those tests that could be 
administered in‘a typical class /period of time. Criterion 12, Scoring, 
gave tests opernwl points for simple and aRYESENIP scoring procedures. 

Norm Range was used to evaluat if the norm sample was broad in age range. 
Score interpretability related to whether converted! scores were of a 
well known type: (e.g. percentiles). Score Conversion gave credit to 


/ . 
/ 
‘tests with a simple conversion procedure from raw score to standard score. 


il 


' Norm Represontativeness, credited those tests with well represented 


norm samples from the, student population Score Interpreter gave a 
ao. 
point to tests that- ‘could be’ Interpreted by the schoot s ae Reicaciaie 


Vs os 


prescriptive decision: ‘could be tiade ev a student sO more. e. prescriptive, 


F . Pe ae he Z ang 
the better). ar ae * ; ae : * ies . i. ae 


. The last set of criteria, were dnt ‘the area Normed Technical Extellfnoe.- 


or) erat. 


Criteria 19 through 21 were-used ‘to give tests credit if they ‘meporked 


> : high coefficients: of Test- -Retest, Interval Consistency and Alternative 


Fox Reliability. The “criterion _ Replicabality gave tests more credit "Ye 


pepideatite procedures for obtaining scores, The Range ‘of coverage” - ot 
criterion was used to award belnee to those instruments aimed ° ‘at , ( 


providing information for a wide’ rage off some behavior Sisal: Finally, ' ’ 


Score Gradation gave tests’ maximal credit for useful converted scores 


A such as centiles rather than crude scores iif pass/fail. , 
: i ; t 
Analysis 


A 24 x 24 matrix of correlations was derived from a xegors by : 
aooren (1971). The matrix was generated by: correlating ratings on each 
criterion with one another. Ratings for Sixth grade tests (N-=. 508) 
were analyzed. The matrix was used ag input for. the multidimensional 
scaling. program, SSA- 1 (Guttman, 1968; Lingoes, 1973; Roskam § Lingoes, 
1970). The latter represents the distances between points in space so 
that positively correlated items are close together and items that , 
correlate zero or negatively are far sent Two measures of the adequacy 
of the solution are provided, both of which the program algorithm attempts 
to aininize in iterative steps: Kruskal's stress coeffigient and the 


Guttman-Lingoes coefficient of alienation. 


ae 


~ RESULTS ee : 


4 . ? 


", A solution for three dimensions was selected for presentation. 


(Kruskal's stress = .11, Guttman-Lingoes coefficient of alienation = .12). 


n 


’- A plot of two- dimensions of this solution (vector 1 against vector 2) is 
, ‘ 


presented in Figure 1\, The numbers in Figure 1 correspond to the 24. | 


‘variables listed in Table 1. 


The stk reveals a radex pattern very Similar to that obtaified by 
Shani § Petrosko (1976). The pint, ney speaking, shows most 
‘vaniabies located in space where they would be eapectnt, | based on the 
theory. 

There were several redsons for discrepancies from theoretically. 


s 
predicted locations. First, aa obvious reason presents itsel€ - there 


‘was an imperfect match between the theoretical conception and the : 


empirical reality. The rational considerations used in constructing 


the theory were not in all cases borne out by how tests are actually * 


rated on their quality. Secondly, several of the criteria in this 


analysis were fot-wtpmsented in the analysis of secondary school tests. _ 


“# 


Such vatiahies were asSigned facet profiles based on a more-or-less 


common sense consideration of Shani'y theory. For e le, variables 9 
‘ 


shESRh 11 in this study had no est ‘equivalents among the 25 variables 


analyued by Shani and Petrosko (0976). 
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Figure 1, Radex for 24 test evaluation variables.” 


86 90 106. 
e*speeeeeoeeeee’ 


ooo 


— 


seewee 


° . 


lai’ iba wiules he, ial le 


: 


! 
{ 


pune eeeeSSSERSTSNSSFEL SS « 


i 


_ DISCUSSION 
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It is important not to lose sight of the practical implications of 
‘this Study. Smallest Space Analysis produces configurations that show 
sha variables are interrelated - the einane the relationship, tie. smaller, 

vie Fy 


. ‘ "she distance between thém. An inspection of Figure 1 shows wich 
ae 


aspects of elementary school test are related to one sistas and wetted 

; are not. Some of the more pertinent. results bear discussion. : 

“ay » Note zone acb,. Many of the variables related to quality of scores 
and norms were closely related to one another. Based on empirical . 
analysis of an actual population of tests, the quality of sore Tange 
(variable 13) was closely related to such things as quality of score 
ereiuaeton (variable 24) and score interpretability (variable 14). Tests’ 

. strong in one of these areas also tended to be strong in the other 
sai. 
An interesting finding was the great divergence between variable 1, 

, Content/Construct validity and variable 2, Concyrrent/Predective validity. 

Vaciebie 1 formed the center of the radex and variable 2 landed up toward ' 

the top part of area agb,. In effect, whether a test was judged as 

adequate in covering the content of a goal area and as having "face 
valid" items, had little to do with existence of empirical validity 
studies for the host: 

Variable 19, Tést-Retest Reliability and variable 21 were found as 


» 
hypothesized, in the aybs zone of Empirical Validity and Reliability. 


. 


However, variable 20, internal consistency reliability, was relatively 


independent of the other reliability types. ; Mt 


oe 
’ 


s . 16 


Finally, some of the often neglected aspects of tests - physical 


a of items, quality of printing - were not related to the central 
Uk } 
issues of ‘Vaidaity and reliabili y and only somewhat related to one 
. ; ; | 


another. . ‘ | 
‘ : } 


ns a concluding note, it might be well to pay heed to the results 


. 


in terms of practical decisions bout tests that ‘many of us make. The 


quality of a standardized test is\ndt a unitery ‘concept. but multivariate. ; 


thether a test might ‘be strong in ‘ohe type of reliability may -have- 
litele to do with si strengths in other areas. Mundarfe , but in some 


cases crucial sation <x like a test's format for recording student 


responses - should be assessed seperately from its other characteristics. 


Especially when a test will be used for a special purpose, or with a 


special population, it should be judged on many indepepident criteria. 


‘ 
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